---
title: How Tokenizers Work
description: Tokenizers split large chunks of text into small, searchable units called tokens
canonical: https://docs.paradedb.com/documentation/tokenizers/overview
---

Before text is indexed, it is first split into searchable units called tokens.

The default tokenizer in ParadeDB is the [simple tokenizer](/documentation/tokenizers/available-tokenizers/simple). It splits
text on whitespace, punctuation, and also [lowercases](/documentation/token-filters/lowercase). To visualize how this tokenizer works,
you can cast a text string to the tokenizer type, and then to `text[]`:

```sql
SELECT 'Hello world!'::pdb.simple::text[];
```

```ini Expected Response
     text
---------------
 {hello,world}
(1 row)
```

On the other hand, the [ngrams](/documentation/tokenizers/available-tokenizers/ngrams) tokenizer splits text into "grams" of size `n`. In this example, `n = 3`:

```sql
SELECT 'Hello world!'::pdb.ngram(3,3)::text[];
```

```ini Expected Response
                      text
-------------------------------------------------
 {hel,ell,llo,"lo ","o w"," wo",wor,orl,rld,ld!}
(1 row)
```

Choosing the right tokenizer is crucial to getting the search results you want. For instance, the simple tokenizer works best for whole-word matching like "hello" or "world", while the ngram tokenizer enables partial matching.

To configure a tokenizer for a column in the index, simply cast it to the desired tokenizer type:

```sql
CREATE INDEX search_idx ON mock_items
USING bm25 (id, (description::pdb.ngram(3,3)))
WITH (key_field='id');
```
